Approximate Stream Analytics in Apache Flink and Apache Spark Streaming

نویسندگان

  • Do Le Quoc
  • Ruichuan Chen
  • Pramod Bhatotia
  • Christof Fetzer
  • Volker Hilt
  • Thorsten Strufe
چکیده

Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, the state-of-the-art systems for approximate computing primarily target batch analytics, where the input data remains unchanged during the course of sampling. Thus, they are not well-suited for stream analytics. This motivated the design of StreamApprox— a stream analytics system for approximate computing. To realize this idea, we designed an online stratified reservoir sampling algorithm to produce approximate outputwith rigorous error bounds. Importantly, our proposed algorithm is generic and can be applied to two prominent types of stream processing systems: (1) batched stream processing such asApache Spark Streaming, and (2) pipelined stream processing such as Apache Flink. To showcase the effectiveness of our algorithm,we implemented StreamApprox as a fully functional prototype based on Apache Spark Streaming and Apache Flink. We evaluated StreamApprox using a set of microbenchmarks and real-world case studies. Our results show that Sparkand Flink-based StreamApprox systems achieve a speedup of 1.15×—3× compared to the respective native Spark Streaming and Flink executions, with varying sampling fraction of 80% to 10%. Furthermore, we have also implemented an improved baseline in addition to the native execution baseline — a Spark-based approximate computing system leveraging the existing sampling modules in Apache Spark. Compared to the improved baseline, our results show that StreamApprox achieves a speedup 1.1×—2.4× while maintaining the same accuracy level. This technical report is an extended version of our conference publication [39].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Benchmarking Distributed Stream Data Processing Systems

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads, there is a clear gap of detailed analyses of the systems’ performance characteristics. In this paper, we propose a ...

متن کامل

On the usability of Hadoop MapReduce, Apache Spark & Apache flink for data science

Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of advanced dataflow oriented platforms, most prominently...

متن کامل

Reproducible Experiments for Comparing Apache Flink and Apache Spark on Public Clouds

Big data processing is a hot topic in today’s computer science world. There is a significant demand for analysing big data to satisfy many requirements of many industries. Emergence of the Kappa architecture created a strong requirement for a highly capable and efficient data processing engine. Therefore data processing engines such as Apache Flink and Apache Spark emerged in open source world ...

متن کامل

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...

متن کامل

A Reference Architecture and Road map for Enabling E- commerce on Apache Spark

Apache Spark is an execution engine that besides working as an isolated distributed, in-memory computing engine also offers close integration with Hadoop’s distributed file system (HDFS). Apache Spark's underlying appeal is in providing a unified framework to create sophisticated applications involving workloads. It unifies multiple workloads, handles unstructured data very well and has easy-to...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1709.02946  شماره 

صفحات  -

تاریخ انتشار 2017